NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

HARP 3.0: Generalizing I/O and API Support for Machine Learning in Digital Audio Workstations

Cwitkowitz, Frank; Benetatos, Christodoulos; Deng, Qixin; Yu, Huiran; Pruyne, Nathan; O’Reilly, Patrick; Garcia, Hugo Flores; Duan, Zhiyao; Pardo, Bryan (December 2025, NeurIPS 2025 Workshop on AI for Music)

Free, publicly-accessible full text available December 1, 2026
Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

https://doi.org/10.1109/ICASSP49660.2025.10890334

Chu, Annie; O’Reilly, Patrick; Barnett, Julia; Pardo, Bryan (April 2025, IEEE)

This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g., ``make this sound in-your-face and bold''). Text2FX operates without retraining any models, relying instead on single-instance optimization within the existing embedding space, thus enabling a flexible, scalable approach to open-vocabulary sound transformations through interpretable and disentangled FX manipulation. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text-audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception. Demos and code are available at anniejchu.github.io/text2fx
more » « less
Free, publicly-accessible full text available April 6, 2026
Code Drift: Towards Idempotent Neural Audio Codecs

https://doi.org/10.1109/ICASSP49660.2025.10890096

O’Reilly, Patrick; Seetharaman, Prem; Su, Jiaqu; Jin, Zeyu; Pardo, Bryan (April 2025, IEEE)

Neural codecs have demonstrated strong performance in high-fidelity compression of audio signals at low bitrates. The token-based representations produced by these codecs have proven particularly useful for generative modeling. While much research has focused on improvements in compression ratio and perceptual transparency, recent works have largely overlooked another desirable codec property -- \textit{idempotence}, the stability of compressed outputs under multiple rounds of encoding. We find that state-of-the-art neural codecs exhibit varied degrees of idempotence, with some degrading audio outputs significantly after as few as three encodings. We investigate possible causes of low idempotence and devise a method for improving idempotence through fine-tuning a codec model. We then examine the effect of idempotence on a simple conditional generative modeling task, and find that increased idempotence can be achieved without negatively impacting downstream modeling performance -- potentially extending the usefulness of neural codecs for practical file compression and iterative generative modeling workflows.
more » « less
Free, publicly-accessible full text available April 6, 2026
Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

https://doi.org/10.1109/ICASSP49660.2025.10888184

Flores_García, Hugo Flores; Nieto, Oriol; Salamon, Justin; Pardo, Bryan; Seetharaman, Prem (April 2025, IEEE)

We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text-only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation.
more » « less
Free, publicly-accessible full text available April 6, 2026
HARP 2.0: EXPANDING HOSTED, ASYNCHRONOUS, REMOTE PROCESSING FOR DEEP LEARNING IN THE DAW

Benetatos, Christodoulos; Cwitkowitz, Frank; Pruyne, Nathan; Garcia, Hugo Flores; O’Reilly, Patrick; Duan, Zhiyao; Pardo, Bryan (November 2024, ISMIR 2024 Late Breaking and Demo)

Full Text Available
HARP: Bringing Deep Learning to the DAW with Hosted, Asynchronous, Remote Processing

Garcia, Hugo Flores; Benetatos, Christodoulos; O'Reilly, Patrick; Aguilar, Aldo; Duan, Zhiyao; Pardo, Bryan (December 2023, NeurIPS 2023 Workshop on Machine Learning for Creativity and Design)

Full Text Available
Deep Learning Tools for Audacity: Helping Researchers Expand the Artist's Toolkit

https://doi.org/10.48550/arXiv.2110.13323

Flores Garcia, Hugo; Aguilar, Aldo; Manilow, Ethan; Vedenko, Dmitry; Pardo, Bryan (October 2021, 5th Workshop on Machine Learning for Creativity and Design at NeurIPS 2021)

We present a software framework that integrates neural networks into the popular open-source audio editing software, Audacity, with a minimal amount of developer effort. In this paper, we showcase some example use cases for both end-users and neural network developers. We hope that this work fosters a new level of interactivity between deep learning practitioners and end-users.
more » « less
Full Text Available
Leveraging Hierarchical Structures for Few-Shot Musical Instrument Recognition

https://doi.org/10.48550/arXiv.2107.07029

Flores Garcia, Hugo; Aguilar, Aldo; Manilow, Ethan; Pardo, Bryan (August 2021, Proc. of the 22nd Int. Society for Music Information Retrieval Conference)

Full Text Available

Search for: All records